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EXAMPLES 

Example 1 ; Development of a Search A IgorUhm Useful for th e Identification of 
Aspartyl Protease*, and Identification ofC eiegans Aspartyl Protease 
Genes in Wormpep 22; 

Materials and Methods: 

Classical aspartyl proteases such as pepsin and renin possess a two-domain structure 
which folds to bring two a^artyi residue* into proximity within the active site. These are 
embedded in the short tripeptide motif DTG, or more rarely, DSG. The DTG or DSG 
active site motif appears at about residue 25-30 in the enzyme, but at about 65-70 in the 
proenzyme (prorenin, pepsinogen). This motif appears again about 150-200 residues 
downstream The proenzyme is activated by cleavage of the N-termical prodomain. This 
pattern exemplifies the double domain structure of the modern day aspartyl enzymes which 
apparently arose by gene duplication and divergence. Thus; 

NH 2 X D^G Y D Y+23 TG C 

where X denotes the beginning of the enzyme, following the >s -terminal prodomain, and Y 
denotes the center of the molecule where the gene repeat begins again. 

In th* case of the retroviral enzymes such as the HIV protease, they represent only a 
half of the two-domain structures of well-known enzymes like pepsin, cathepsin D, renin, 
etc. They have no prosegment, but are carved out of a polyprotem precursor containing the 
gag and pot proteins of the vims They can be represented by: 

NH 2 D^G CI 00 

This ' monomer" only has about 100 aa, so is extremely parsimonious as compared to the 
other aspartyl protease "dirrjers" which have of the order of 330 or so aa, not counting the 
N-termmal prodomain. 

The limited length of the eukaryotic aspartyl protease active site motif makes it 
difficult to search EST collections for novel sequences. EST sequences typically average 
250 nucleotides, and so in this case would be unlikely to span both aspanyl protease active 
site motifs. Instead, we turned to the C eiegans genome. The C. eiegans genome is 
estimated to contain around 13,000 genes. Of these, roughly 12,000 have been sequenced 
and the corresponding hypothetical open reading frame CORF} has been placed in the 
database Wonnpepl2. We used this database as the basis for a whole genome scan of a 
higher eukaryote for novel aspartyl proteases, using an algorithm that we developed 
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specifically for this purpose. The following AWK script for locating proteins containing two 
DTG or DSG motifs was used for the search, which was repeated four times to recover all 
pairwise combinations of the asparty] motif. 

BEGIN{RS=">"} r defines *V 4 as record separator for FASTA format */ 

{ 

pos = indexWDTG") Minds "DTG" in record*/ 
if (pos>0) { 

rest = substr<$0,pos+3) /*get rest of record after first DTG*/ 

pos2 = index(re$t/*DTG*') f*find second DTG*/ 

if (pos2>0 } ptintf ("%$5b$W">"$Q} I /*repon hits*/ 

J 
} 

The AWK script shown above was used to search Worm pep 12, which was 
downloaded from ftp.sariger.ac.uk/pub/databases/wormpep, for sequence entries containing 
at least two DTG or DSG motifs. Using AWK limited each record to 3000 characters or 
less. Thus, 35 or so larger records were eliminated manually from 
Wormpepl2 as in any case these were unlikely to encode aspartyl proteases. 

Results and Discussion: 

The Wormpep 12 database contains 12,178 entries, although some of these (<10%) 
represent alternatively spliced transcripts from the same gene. Estimates of the number of 
genes encoded in the C elegant genome is on the order of 13,000 genes, so Wormpepl2 
may be estimated to cover greater than 90% of the C elegans genome. 

Eukaryotic aspartyl proteases contain a two-domain structure, probably arising from 
ancestral gene duplication. Each domain contains the active site motif D(S/T)G located 
from 20-25 amino acid residues into each domajn The retroviral (e.g., HIV protease) or 
rctrotransposon proteases are bomodimers of subunits which are homologous to a single 
eukaryotic aspartyl protease domain. An AWK script was used to search the Wormpepl2 
database for proteins in which the D(S/T)G motif occurred at least twice. This identified 
>60 proteins with two DTG or DSG motifs. Visual inspection was used to select proteins 
in which the posirio^ of the asparty] domains was suggestive of a two-domain structure 
meeting the criteria described above. 

In addition, the PROS1TE eukaryotic and viral aspartyl protease active site pattern 
PS00141 was used to search Wormpepl2 for candidate aspartyl proteases. (Bairoch A., 
Bucher P M Hofmann K., The PROS1TE database; its status in 1997, Nucleic Acids Res. 
24:217-221(1997)), This generated an overlapping set of Wormpepl2 sequences. Of these, 
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seven sequences contained two DTG ox BSG motifs and the PROSITE aspartyl protease 
active site pattern. Of these seven* three were found in the same ccsmid done (F21F8.3, 
F21F8.4, and F21F8.7) suggesting that they represent a family of proteins that arose by 
ancestral geae duplication. Two other ORFs with extensive homology to F21F83, F21F8.4 
and F21F8.7 are present in the same gene cluster (F21F8.2 and F21F8.6), however, these 
conxain only a single DTG motif. Exhaustive BLAST searches with these seven sequences 
against Wormpepl2 failed to reveal additional candidate aspartyl proteases in the C 
elegans genome containing two repeats of the DTG or DSG motif. 

BLASTX search with each C. elegans sequence against SWISS-PROT, GcnPep and 
TREMBL revealed that R12H7.2 was She closest worm homologue to the known 
mammalian aspanyl proteases, and that T18H9.2 was somewhat more distantly related, 
while CEASP1, F2IF8.3, F22F8.4, aad F21F8.7 formed a subcluster which had the least 
sequence homology to the mammalian sequences. 
Discussion: 

APF, the presentlms, aad p35, the activator of cdk3, all undergo intracellular 
proteofytsc processing at sites wliich conform to the substrate specificity of the HIV 
protease. Dysregulation of a cellular aspartyl protease with the same substrate specificity, 
might therefore provide a unifying mechanism for causation of the plaque and tangle 
pathologies in AD. Therefore, we sought to identify novel human aspanyl proteases. A 
whole genome scan in C. elegans identified seven open reading frames that adhere to the 
aspartyl protease profile that we had identified. These seven aspanyl proteases probably 
comprise the complete complement of such proteases in a simple, multicellular eukaryote. 
These include four closely related aspartyl proteases unique to C. elegans which probably 
arose by duplication of an ancestral gene. Hie other three candidate aspartyl proteases 
(Tl 8H9.2, R12H7.2 and CI ID2.2) were found to have homology to mammalian gene 
sequences. 
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